System and method for real-time automatic video key frame analysis for critical motion in complex athletic movements

ABSTRACT

A system and method train a neural network to process video for the identification of critical points of motion in a movement. Embodiments include a process for training the neural network and a process for identifying the video frames including a critical point based on the criteria of the neural network. An exemplary embodiment includes a software application which may be accessible through a computing device to analyze videos on site.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims benefit under 35 U.S.C. § 119(e) of U.S. Provisional Application having Ser. No. 63/083,286 filed Sep. 25, 2020, which is hereby incorporated by reference herein in its entirety.

FIELD

The subject disclosure relates to video analysis processes, and more particularly, to a system and method for real-time automatic video key frame analysis for critical motion in complex athletic movements.

BACKGROUND

Video analysis of bio-mechanical movement is becoming very popular for applications such as sports motion and physical therapy. However, identifying which frames in a video include the content of most significance can be a challenge when pouring through hundreds of frames in a sequence. Trying to do so quickly is prone to significant error. Conventionally, a user would try to freeze frame different points in a general section of frames to show for example, a human subject where the body was deficient in a movement. With the proliferation of mobile computing devices, video can be analyzed remotely. The analyzer can extract select frames to show the subject user the flaws in mechanics. However, single frames lack context and seeing the action in real-time or at a high frame rate is more desirable.

Some approaches have attempted to automate the process of identifying critical points in motion. Previous solutions may use for example, pose estimation to analyze movement, which is a flawed technique and very computationally slow and inefficient. Pose estimation is prone to errors in lighting, background, and clothing type. Using poses for intermediate data (feature vector) to other models injects human intelligence into the process, which has significant issues. Moreover, the processing power needed makes it near impossible to run analysis using pose estimation in real-time at high frame-rates on a mobile device.

SUMMARY

In one aspect of the disclosure, a method for identifying critical points of motion in a video is disclosed. The method includes using a feature extraction neural network trained to identify features among a plurality of images of movements. The features are associated with a critical point of movement. A video sequence is received including a plurality of video frames comprising a motion captured by a camera. The feature extraction neural network identifies individual frames from the plurality of video frames. The identified individual frames include a known critical point in the captured motion based on identified features associated with a critical point of movement. The identified frames which show the critical points in the captured motion are displayed.

It is understood that other configurations of the subject technology will become readily apparent to those skilled in the art from the following detailed description, wherein various configurations of the subject technology are shown and described by way of illustration. As will be realized, the subject technology is capable of other and different configurations and its several details are capable of modification in various other respects, all without departing from the scope of the subject technology. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not as restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of a process for generating an artificial intelligence module configured to analyze video for critical motion according to an illustrative embodiment of the subject technology.

FIG. 2 is a diagrammatic view of a system capturing a golf swing for use in critical motion analysis according to an illustrative embodiment of the subject technology.

FIG. 3 is a diagrammatic view of a system capturing a weightlifting movement for use in critical motion analysis according to an illustrative embodiment of the subject technology.

FIG. 4 is a diagrammatic view of a user viewing a software embodiment of the subject technology displaying critical motion analysis according to an illustrative embodiment of the subject technology.

FIG. 5 is a flowchart of a method of extracting critical motion video frames from a video according to an illustrative embodiment of the subject technology.

FIG. 6 is a flowchart of a method of automated identification of frames for video analysis of critical motion according to an illustrative embodiment of the subject technology.

FIG. 7 is a flowchart of a computer implemented method for converting manual labels into machine readable labels according to an illustrative embodiment of the subject technology.

FIG. 8 is a block diagram of an architecture for generating real-time automatic video key frame analysis for critical motion in complex athletic movements according to an illustrative embodiment of the subject technology.

FIG. 9 is a block diagram of a computing device for analyzing critical motion in complex athletic movements according to an illustrative embodiment of the subject technology.

DETAILED DESCRIPTION

The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology may be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, it will be apparent to those skilled in the art that the subject technology may be practiced without these specific details. Like or similar components are labeled with identical element numbers for ease of understanding.

In general, and referring to the Figures, illustrative embodiments of the subject technology provide processes which automate the identification of critical points of motion in a video that captured a sequence of movements. Aspects of the subject technology provide training for an artificial intelligence (A.I.) or machine learning module. The embodiments described below are capable of engineering training data in such a way that builds a neural network configured to learn and identify the optimal features from a video sequence, which is far superior to using for example, pose-estimation. It is computationally faster and significantly less prone to error. In another aspect, the automated process generates the display of video frames from a video that show the critical points of motion in a movement sequence.

In an illustrative embodiment, aspects of the subject technology may be applied to analysis of a bio-mechanical movement. For example, in sports, specific sequences of movements have the same general motion but identifying the inefficient placement of one body part to another during the motion is a challenge. Generally, there are certain points in the motion that are more critical to generating an outcome of success from the movement than others. Finding the video frames that show these points in the movement when one subject person's body is different than another is susceptible to error when done manually or by other techniques such as pose estimation. Aspects of the subject technology improve the accuracy in identifying the critical points in videos between different subjects. Some embodiments include additional analysis once critical frames are found. Embodiments may include a finely tuned neural network that identifies body parts for a particular critical frame (accuracy being key for body part identification). Body parts may then be evaluated for positioning deviation from an ideal position. An ideal position may, in some applications, be associated with generating an optimal characteristic (for example, maximum power, successful contact, successful object path after contact, among other characteristics depending on the movement). Other analysis including for example, finding objects (golf club, barbell, projectile) may also be performed on a specific critical frame giving more insight into user/player performance. For example, the position of a golf club may be considered crucial to performance in some critical frames.

FIG. 1 shows a method of training an A.I. component of the system according to an illustrative embodiment. The method may include initial steps using expert analysis to generate a training set which the A.I. component uses as a baseline to begin automatically identifying video frames showing critical points of motion when presented with a video sequence. For example, step 10 may include a data labeling process. In the data labeling process, experts may select for labeling the exact frame in a video where a critical motion takes place.

Step 20 may include automatic key frame labeling. In the keyframe labeling, the initial labeled data may be engineered using an automatic process that converts it from human labels to labels better understood by a machine. In some embodiments, image similarity scoring using for example, a structured similarity index may be included prior to training. In some embodiments, using the labeled data, an automatic process may determine the optimal statistical range that other images should be counted as a keyframe for training. For example, the unlabeled video frames may be scored based on an image similarity algorithm for [0,1]−1 being exactly the same. Other embodiments may determine the median of the distribution after removing outliers. Another embodiment may compute the mean/standard deviation and discard anything outside 1 standard deviation.

Step 30 may include training feature extraction. In training for feature extraction, the engineered labels may then be used to train a feature extraction neural network. A custom constructed neural network may be generated which is ideal for extracting high level features capable of distinguishing critical points of motion from an image. Embodiments may integrate a unique and custom loss function which the model uses during training to evaluate itself and learn an optimal maximum. Through trial and error, and research, the optimal amount of stochastic augmentations which may be applied to input images may be found. This step may alter images in such a way and just enough that challenges the neural network to learn better. Too little and the network may not reach an optimal maximum: too much and the network may not learn anything.

Step 40 may include training a temporal network. The feature extraction network may be used to train the temporal model. In some embodiments, a custom-built neural network may be developed, ideal for understanding high level features over time. These features may be extracted from the neural network in the previous step. In this step the same loss function may be utilized to boost training.

The final output of the above process may be a neural network embodiment which may be executed/performed by computer hardware in real-time to analyze videos. The neural network takes in sequences of images, for example, video, and then identifies the best image from the set that matches a desired critical motion to analyze further. Once the neural network is trained, some embodiments may include a software embodiment which executes the processes of identifying the frames with critical points of motion for display to an end user.

FIGS. 2 and 3 show example applications using aspects of the subject technology. In these figures, two different sports movements are being performed. In FIG. 2, a subject 60 is performing a golf swing. In some embodiments, the subject 60's swing may be captured by a digital camera 50. The captured video may be transferred to another computing device which runs a software embodiment that identifies critical motion and displays one or more frames. FIG. 3 shows a subject 80 performing a weightlifting motion. Some embodiments may include the software loaded onto a mobile computing device 70 (for example, a smart phone, portable tablet, wearable smart device, etc.) which includes a camera and may include a digital display. The user may run the software embodiment on the captured video. The software may extract from the captured video a video frame with a critical point of motion for display on the computing device.

FIG. 4 shows an example use case where video captured previously (for example, by the camera 50 or mobile computing device 70) is loaded onto a computing device with a display. A plurality of frames 100 from a video sequence of a golf swing are shown simultaneously on the display in temporal order. The software may identify a frame 110 as showing a critical point in the motion of a golf swing. Identification of the frame 110 may be highlighted in some way for a user 90 to understand which is the frame that requires attention. The user 90 may then look at the frame 110 to analyze further for issues, if any, present in the swing motion.

FIG. 5 shows a process 500 of automated video frame processing according to an illustrative embodiment of the subject technology. The process 500 may include receiving 510 video sequencing data of a person's motion captured by a camera. Individual images from the video sequence may be processed 520 for feature extraction by a neural network. The extracted features are forwarded 530 to the temporal neural network. The output from the temporal neural network model may be converted 540 to timestamps from the original video sequence.

FIG. 6 shows a process 600 for automated identification of frames for video analysis of critical motion according to an illustrative embodiment. Raw video sequence data may be retrieved 605 from a camera. In some embodiments, frames in the raw video data may be selected and/or labeled 610 with timestamps. The frames selected may be chosen based on showing a critical motion within a movement sequence inside the frame. The labels may be human determined. A processor converts 615 the human determined labels into machine readable labels. The processor may train 620 a neural network for feature extraction using the machine determined labels. The features used for training the neural network may be from the raw video frames and in some embodiments, other video. Output data from the feature extractor neural network may be used to train 625 the temporal neural network.

In general, the feature extractor module converts a single two-dimensional image into a compressed representation optimal for temporal training. The temporal model uses that representation to then learn underlying sequence of events and when the critical frames occur. The temporal neural network is being trained on the input feature vectors from the feature extractor neural network. Then the temporal neural network is predicting the critical key frames using these feature vectors. Training is split into two steps in some embodiments because the computational power/memory requirement to construct a single monolithic neural network that goes from video images directly to critical frames may be too heavy for some computing systems. The subject approach is highly optimized since the process(es) require significantly less computational power/memory and is significantly faster to construct. This may be done by splitting training into two steps (feature extractor and temporal model). The feature extractor neural network essentially transforms/maps an image to a latent (hidden space) representation. The latent space is a compressed representation that optimizes training for the temporal model. The temporal model uses the latent space representation to predict the critical frames. The temporal model also learns the underlying temporal patterns between critical frames; for example, the sequence of events and the time that passes between critical frames.

FIG. 7 shows a computer implemented method 700 for converting human labels into machine readable labels according to an illustrative embodiment. The processor receives 710 human labeled video frames. For a unique timestamp, a video frame is extracted 710. The extracted video frame is labeled 720 and placed in a labeled category. Some or all of the remaining frames within the same timestamp are extracted and placed 730 into an unlabeled category. A temporal distance may be calculated 740 between all non-labeled frames in a video and the labeled critical key frames. Unlabeled frames are associated 750 with the labeled frame from which they have the shortest temporal distance. For each labeled frame and their unlabeled frames, the processor generates 760 an image comparison score based on similarity. The similarity score may be calculated using different algorithms. For example, one possible similarity algorithm may be based on determining a normalized root mean squared error. Another algorithm may be based on a structural similarity index. As will be understood, other similarity score bases may be used without departing from the scope of the disclosed embodiments. In some embodiments, scores may be aggregated and statistically analyzed 770 to find optimal ranges. For each labeled frame and associated unlabeled frames, if an unlabeled frame similarity score falls within a statistical range, the unlabeled frame may be labeled 780. Any remaining unlabeled frames may be placed 790 into a no-label category.

The statistical range for determining whether an unlabeled frame should be converted into a labeled frame may be based on a distribution curve of all the similarity values calculated for a specific keyframe. For example, a normalized root mean squared error may be used as a similarity algorithm. Using this algorithm comparing the specific keyframe to itself would produce a value 0.0 because they are exactly equal. As a particular frame gets further away in time from the keyframe the similarity value may go up stopping at 1.0 because it's normalized from 0.0 to 1.0. The system may collect all these values for a specific keyframe across all the videos in the training dataset. Assuming for example, there are 2000 similarity metrics for a keyframe, a statistical analysis may include removing outliers. For simplicity sake to explain the process, for a process that includes a basic outlier removal, the median value of the list of similarity metrics may be determined. If the median value is for example, 0.25, this value may be used to determine which frames may receive a label from those un-labeled frames associated to a keyframe. Based on the similarity algorithm, the higher the value the more dis-similar the frame is. So, any frame with a score greater than 0.25 (as an example using the illustrate value above), is placed into the no-label category. Inversely if the score is less than or equal to 0.25 it is labeled as keyframe.

During the development of one embodiment, domain experts labelled a start and end point for a critical time period in a video. This resulted in inconsistent data. due to human's inability to consistently perceive tiny visual changes in images across many video sequences. These aggregated inconsistencies can result in too much randomness that are difficult to be modeled statistically. As will be appreciated, the automated portions of the subject technology (for example, via computer recognition) are more adept at identifying tiny visual changes and can consistently manage these across a wide variety of video sequences. This produces a statistically stable distribution.

As will be appreciated, users may benefit from the various aspects of the subject technology when applied to for example, analyzing the motion of a movement to identify flaws or inefficiencies in the movement (for example, hitches, improper alignment of a body part, etc.) The output is provided nearly instantaneously in a software application embodiment which displays a frame with a critical point of motion right after the video is taken. Users are able to use the application anywhere and may do so on site where they are moving/performing a motion. For examples, golfers may video their golf stroke during a round and through the subject technology, see on display why their shot went awry.

FIG. 8 illustrates an example architecture 800 for generating real-time automatic video key frame analysis for critical motion in complex athletic movements. Architecture 800 includes a network 806 that allows various computing devices 802(1) to 802(N) to communicate with each other, as well as other elements that are connected to the network 806, such as an image input data source 812, an image analytics service server 816, and the cloud 820.

The network 806 may be, without limitation, a local area network (“LAN”), a virtual private network (“VPN”), a cellular network, the Internet, or a combination thereof. For example, the network 806 may include a mobile network that is communicatively coupled to a private network, sometimes referred to as an intranet that provides various ancillary services, such as communication with various application stores, libraries, and the Internet. The network 806 allows the image critical points analytics engine 810, which is a software program running on the image analytics service server 816, to communicate with the image input data source 812, computing devices 802(1) to 802(N), and the cloud 820, to provide analysis of critical points in complex movements. In one embodiment, the data processing is performed at least in part on the cloud 820.

For purposes of later discussion, several user devices appear in the drawing, to represent some examples of the computing devices that may be the source of image data. Image data may be in the form of video sequence data files that may be communicated over the network 806 with the image critical points analytics engine 810 of the image analytics service server 816. Today, user devices typically take the form of portable handsets, smart-phones, tablet computers, personal digital assistants (PDAs), and smart watches, which generally include cameras integrated into their respective device packaging.

For example, a computing device (e.g., 802(N)) may send a request 103(N) to image critical points analytics engine 810 to analyze the features of video sequence data captured by the computing device 802(N), so that critical points in the motion captured are identified and analyzed for deviation from an ideal form/positioning.

While the image input data source 812 and image critical points analytics engine 810 are illustrated by way of example to be on different platforms, it will be understood that in various embodiments, the image input data source 812 and the image analytics service server 816 may be combined. In other embodiments, these computing platforms may be implemented by virtual computing devices in the form of virtual machines or software containers that are hosted in a cloud 820, thereby providing an elastic architecture for processing and storage.

As discussed above, functions relating critical point motion analysis of the subject disclosure can be performed with the use of one or more computing devices connected for data communication via wireless or wired communication, as shown in FIG. 8. FIG. 9 is a functional block diagram illustration of a computer hardware platform that can communicate with various networked components, such as a training input data source, the cloud, etc. In particular, FIG. 9 illustrates a network or host computer platform 900, as may be used to implement a server, such as the image analytics service server 816 of FIG. 8.

The computer platform 900 may include a central processing unit (CPU) 904, a hard disk drive (HDD) 906, random access memory (RAM) and/or read only memory (ROM) 908, a keyboard 910, a mouse 912, a display 914, and a communication interface 916, which are connected to a system bus 902.

In one embodiment, the HDD 906, has capabilities that include storing a program that can execute various processes, such as the neural network engine 940, in a manner described herein. The neural network engine 940 may be part of the image analytics service server 816 of FIG. 8. Generally, the neural network engine 940 may be configured to analyze image data for critical points during a complex movement under the embodiments described above. The neural network engine 940 may have various modules configured to perform different functions. In some embodiments, the deployment engine 940 may include sub-modules. For example, a feature extraction module 942, an image temporal analyzer 944, a temporal distance calculator 946, an image similarity engine 948, an image label categorization module 950, and a machine learning module 956. The functions of these sub-modules have been discussed above in the various flowcharts of for example, FIGS. 1, 5, 6, and 7.

In another embodiment, the computer platform 900 may represent an end user computer (for example, computing devices 802(1) to 802(N)) of FIG. 8. In this context, the various end user computer platforms 900 may include mixed configurations of hardware elements and software packages.

As will be appreciated by one skilled in the art, aspects of the disclosed invention may be embodied as a system, method or process, or computer program product. Accordingly, aspects of the disclosed invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module,” or “system.” Furthermore, aspects of the disclosed invention may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.

Any combination of one or more computer readable media may be utilized. In the context of this disclosure, a computer readable storage medium may be any tangible or non-transitory medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.

Aspects of the disclosed invention are described below with reference to block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. The previous description provides various examples of the subject technology, and the subject technology is not limited to these examples. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” Unless specifically stated otherwise, the term “some” refers to one or more. Pronouns in the masculine (e.g., his) include the feminine and neuter gender (e.g., her and its) and vice versa. Headings and subheadings, if any, are used for convenience only and do not limit the invention.

A phrase such as an “aspect” does not imply that such aspect is essential to the subject technology or that such aspect applies to all configurations of the subject technology. A disclosure relating to an aspect may apply to all configurations, or one or more configurations. An aspect may provide one or more examples. A phrase such as an aspect may refer to one or more aspects and vice versa. A phrase such as an “embodiment” does not imply that such embodiment is essential to the subject technology or that such embodiment applies to all configurations of the subject technology. A disclosure relating to an embodiment may apply to all embodiments, or one or more embodiments. An embodiment may provide one or more examples. A phrase such an embodiment may refer to one or more embodiments and vice versa. A phrase such as a “configuration” does not imply that such configuration is essential to the subject technology or that such configuration applies to all configurations of the subject technology. A disclosure relating to a configuration may apply to all configurations, or one or more configurations. A configuration may provide one or more examples. A phrase such a configuration may refer to one or more configurations and vice versa.

The word “exemplary” is used herein to mean “serving as an example or illustration.” Any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs. 

What is claimed is:
 1. A method for identifying critical points of motion in a video, comprising: using a feature extraction neural network trained to identify features among a plurality of images of movements, wherein the features are associated with a critical point of movement; receiving a video sequence including a plurality of video frames comprising a motion captured by a camera; identifying, by the feature extraction neural network, individual frames from the plurality of video frames, wherein the identified individual frames include a known critical point in the captured motion based on identified features associated with a critical point of movement; and displaying the identified frames which show the critical points in the captured motion.
 2. The method of claim 1, further comprising training a temporal neural network based on input from the feature extraction neural network.
 3. The method of claim 2, further comprising converting an output of the temporal neural network model into time stamps of the plurality of video frames.
 4. A method of automated identification of frames for video analysis of critical motion, comprising: receiving a raw sequence of video data; receiving selections of the raw sequence of video data, wherein the selections include manual entered labels; automatically converting the manual entered labels to machine readable labels; training a neural network feature extractor based on the machine readable labels to identify video frames that include critical points of motion.
 5. The method of claim 4, further comprising training a temporal neural network based on an output of the neural network feature extractor.
 6. The method of claim 4, further comprising: extracting a selected video frame for each unique timestamp; labeling the extracted video frames; grouping remaining unlabeled video frames; and calculating a temporal distance between labeled video frames and unlabeled video frames.
 7. The method of claim 6, further comprising associating unlabeled video frames with a labeled video frame based on a shortest temporal distance between each unlabeled video frame to the labeled video frames.
 8. The method of claim 7, further comprising generating a similarity score for each labeled video frame and its associated unlabeled video frames based on an image comparison of content in the labeled video frame and the labeled video frame's associated unlabeled video frames.
 9. The method of claim 8, further comprising labeling one of the unlabeled video frames in the event the similarity score for the labeled video frame and the labeled video frame's associated unlabeled video frames falls within a threshold range of values.
 10. The method of claim 9, further comprising using the unlabeled video frames that has been labeled in training the neural network feature extractor to identify video frames that include critical points of motion.
 11. A computer program product for automated identification of frames for video analysis of critical motion, the computer program product comprising: one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media, the program instructions comprising: receiving a raw sequence of video data; receiving selections of the raw sequence of video data, wherein the selections include manual entered labels; automatically converting the manual entered labels to machine readable labels; training a neural network feature extractor based on the machine readable labels to identify video frames that include critical points of motion.
 12. The computer program product of claim 11, wherein the program instructions further comprise training a temporal neural network based on an output of the neural network feature extractor.
 13. The computer program product of claim 11, wherein the program instructions further comprise: extracting a selected video frame for each unique timestamp; labeling the extracted video frames; grouping remaining unlabeled video frames; and calculating a temporal distance between labeled video frames and unlabeled video frames.
 14. The computer program product of claim 13, wherein the program instructions further comprise associating unlabeled video frames with a labeled video frame based on a shortest temporal distance between each unlabeled video frame to the labeled video frames.
 15. The computer program product of claim 14, wherein the program instructions further comprise generating a similarity score for each labeled video frame and its associated unlabeled video frames based on an image comparison of content in the labeled video frame and the labeled video frame's associated unlabeled video frames.
 16. The computer program product of claim 15, wherein the program instructions further comprise labeling one of the unlabeled video frames in the event the similarity score for the labeled video frame and the labeled video frame's associated unlabeled video frames falls within a threshold range of values.
 17. The computer program product of claim 16, wherein the program instructions further comprise using the unlabeled video frames that has been labeled in training the neural network feature extractor to identify video frames that include critical points of motion. 